Analysis of Vehicle Collisions in Canada

Jas Sohi

April 1, 2015

Goal

Data Wrangling

I chose to subset this data and only look at the most recent year (2011) due to the enormous size of the original dataset with over a million records. This new raw dataset contains just over 300,000 records.

Each record represents one collision report per person involved in a collision in Canada. So if there were 4 people involved in a collision there would be 4 separate records.

Number of records, variables

## [1] 301724
## [1] 23

Variable names

##  [1] "X"      "C_YEAR" "C_MNTH" "C_WDAY" "C_HOUR" "C_SEV"  "C_VEHS"
##  [8] "C_CONF" "C_RCFG" "C_WTHR" "C_RSUR" "C_RALN" "C_TRAF" "V_ID"  
## [15] "V_TYPE" "V_YEAR" "P_ID"   "P_SEX"  "P_AGE"  "P_PSN"  "P_ISEV"
## [22] "P_SAFE" "P_USER"

Data Dictionary (.doc)

Univariate Analysis

What is the structure of your dataset?

Structure of the data

## 'data.frame':    301724 obs. of  23 variables:
##  $ X     : int  4598867 4598868 4598869 4598870 4598871 4598872 4598873 4598874 4598875 4598876 ...
##  $ C_YEAR: int  2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
##  $ C_MNTH: Factor w/ 13 levels "01","02","03",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ C_WDAY: Factor w/ 8 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ C_HOUR: Factor w/ 25 levels "00","01","02",..: 11 13 1 18 18 23 17 17 14 14 ...
##  $ C_SEV : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ C_VEHS: Factor w/ 21 levels "01","02","03",..: 1 1 1 2 2 1 2 2 2 2 ...
##  $ C_CONF: Factor w/ 20 levels "01","02","03",..: 2 4 3 7 7 4 16 16 16 16 ...
##  $ C_RCFG: Factor w/ 12 levels "01","02","03",..: 3 12 12 12 12 5 2 2 2 2 ...
##  $ C_WTHR: Factor w/ 9 levels "1","2","3","4",..: 1 1 7 1 1 1 4 4 1 1 ...
##  $ C_RSUR: Factor w/ 11 levels "1","2","3","4",..: 3 5 3 1 1 3 3 3 1 1 ...
##  $ C_RALN: Factor w/ 8 levels "1","2","3","4",..: 2 1 1 1 1 1 4 4 1 1 ...
##  $ C_TRAF: Factor w/ 19 levels "01","02","03",..: 17 19 19 17 17 17 15 15 17 17 ...
##  $ V_ID  : Factor w/ 60 levels "01","02","03",..: 1 1 1 1 2 1 1 2 1 2 ...
##  $ V_TYPE: Factor w/ 20 levels "01","05","06",..: 1 1 1 1 1 1 1 1 1 14 ...
##  $ V_YEAR: Factor w/ 81 levels "1910","1911",..: 81 81 81 81 81 81 81 81 81 81 ...
##  $ P_ID  : Factor w/ 65 levels "01","02","03",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ P_SEX : Factor w/ 4 levels "F","M","N","U": 2 1 1 1 2 1 2 1 1 2 ...
##  $ P_AGE : Factor w/ 101 levels "01","02","03",..: 75 21 34 50 63 26 34 20 80 50 ...
##  $ P_PSN : Factor w/ 16 levels "11","12","13",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ P_ISEV: Factor w/ 5 levels "1","2","3","N",..: 2 2 2 2 1 2 2 2 2 1 ...
##  $ P_SAFE: Factor w/ 10 levels "01","02","09",..: 8 2 2 2 8 2 2 2 2 8 ...
##  $ P_USER: Factor w/ 6 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 6 ...

Summary of the data

##        X               C_YEAR         C_MNTH           C_WDAY     
##  Min.   :4598867   Min.   :2011   07     : 29320   5      :49913  
##  1st Qu.:4674298   1st Qu.:2011   08     : 29203   4      :46053  
##  Median :4749728   Median :2011   01     : 28799   3      :44428  
##  Mean   :4749728   Mean   :2011   06     : 28194   6      :44106  
##  3rd Qu.:4825159   3rd Qu.:2011   09     : 26521   2      :42602  
##  Max.   :4900590   Max.   :2011   05     : 24760   1      :39503  
##                                   (Other):134927   (Other):35119  
##      C_HOUR           C_SEV           C_VEHS           C_CONF     
##  16     : 26576   Min.   :1.000   02     :184411   21     :89896  
##  15     : 25671   1st Qu.:2.000   01     : 66643   35     :39812  
##  17     : 25541   Median :2.000   03     : 36476   06     :25445  
##  14     : 20397   Mean   :1.983   04     :  9852   36     :23836  
##  12     : 19618   3rd Qu.:2.000   05     :  2409   33     :21836  
##  18     : 19320   Max.   :2.000   06     :   813   QQ     :18475  
##  (Other):164601                   (Other):  1120   (Other):82424  
##      C_RCFG           C_WTHR           C_RSUR           C_RALN      
##  02     :140705   1      :208146   1      :196596   1      :219870  
##  01     :110967   2      : 33011   2      : 55230   2      : 25567  
##  UU     : 22287   3      : 29662   5      : 16788   U      : 19823  
##  03     : 13133   4      : 19736   3      : 13319   3      : 18336  
##  QQ     :  8985   6      :  4313   Q      : 10257   4      : 10160  
##  05     :  2746   U      :  4079   4      :  4846   5      :  4163  
##  (Other):  2901   (Other):  2777   (Other):  4688   (Other):  3805  
##      C_TRAF            V_ID            V_TYPE           V_YEAR      
##  18     :156071   01     :160208   01     :245756   2007   : 21771  
##  01     : 84044   02     :109102   NN     : 13503   2005   : 20627  
##  03     : 32008   03     : 15048   14     :  7464   2008   : 19684  
##  UU     : 13609   99     : 12819   17     :  6469   2003   : 19357  
##  04     :  4687   04     :  3161   06     :  5806   2006   : 19314  
##  QQ     :  4052   05     :   706   07     :  5275   2010   : 18279  
##  (Other):  7253   (Other):   680   (Other): 17451   (Other):182692  
##       P_ID        P_SEX          P_AGE            P_PSN        P_ISEV    
##  01     :216999   F:126255   UU     : 23368   11     :200139   1:119265  
##  02     : 57349   M:157804   19     :  8027   13     : 41526   2:155761  
##  03     : 16713   N:   839   18     :  7870   23     : 12746   3:  2006  
##  04     :  6289   U: 16826   20     :  7668   99     : 11664   N: 17489  
##  05     :  1989              21     :  7379   21     : 10664   U:  7203  
##  NN     :   617              17     :  6865   UU     :  6990             
##  (Other):  1768              (Other):240547   (Other): 17995             
##      P_SAFE       P_USER    
##  02     :211675   1:185197  
##  UU     : 31933   2: 74796  
##  NN     : 30159   3: 12819  
##  01     :  8907   4:  6469  
##  13     :  7626   5:  7464  
##  09     :  7535   U: 14979  
##  (Other):  3889

There are three kinds of variables in this dataset: collision-level, vehicle-level, and person-level.

COLLISIONS

From an initial look a the summary output, Friday (5) appears to be the weekday with the most number of collisions reported. The 4-5 PM Hour (16) is the most dangerous time for motorists. Most collisions involve 2 vehicles (C_VEHS). Rear-end collisions (21) are the most common types of accidents (C_CONF) and most accidents happen at an intersection (02) (C_RCFG). Nothing out of the ordinary so far. However, perhaps surprisingly, most accidents actually occur on clear and sunny days by a wide margin (1) (C_WTHR) and on dry, normal (1) (C_RSUR) straight and level (1) (C_RALN) road surfaces. Finally, most collisions occur where no traffic controls (such as stop signs, police officers, reduced speed zone, etc.) are present.

VEHICLES

Most collisions involved light duty vehicles such as passenger cars, vans, and light duty pickup trucks (01) (V_TYPE). Most collisions involve more recent model vehicles, but there isn’t a large variance between years (V_YEAR).

PEOPLE

Men are involved in more collisions than women. Many people don’t report their age (UU = Unknown) (P_AGE), but from the ones that do, most consist of young adults and teenagers (17-21 years old). About 2/3rds of the reported collisions (each record) are from drivers (11) themselves instead of passengers (P_PSN). Only 0.66 % of reported collisions involved a fatality (3), while the rest of the collisions were split mostly between injuries/no injuries (P_ISEV). In the vast majority of incidents, a safety device such as a seat belt or helmet (for motorcycles, bicyclists, snowmobile, and ATV riders) was worn (02) (P_SAFE). This highlights the fact that there are different categories of people reporting these collisions, but the vast majority are vehicle drivers (1) and passengers (2) with pedestrians (3) a distant third (P_USER).

Univariate Plots Section

Initial Age Plot

Looks like there is something unusual with this plot towards the right tail.

##     Var1  Freq
## 100   NN  1029
## 101   UU 23368
##    UU    19    18    20    21    17    22    23    25    24 
## 23368  8027  7870  7668  7379  6865  6661  6315  6062  5999

Further data wrangling

Looks like a lot of people did not report their age (UU) or was not applicable (NN) (e.g. “dummy” person record created for parked cars). We’ll go ahead and remove these records from the dataset and also clean up the gender variable to show only males and females.

We are now looking at 275270 records instead of just over three hundred thousand from the original data. We got rid of under 10% of the records and it shouldn’t affect our ultimate analysis.

Second Age Plot

This chart now shows the distribution of ALL people involved in collisions and we can see that it is bimodal with one peak near later teenagers/young adults and another what looks like people in their 40s. Let’s confirm that.

From this chart (age > 15), we can see more precisely that 45-50 is where the second peak is. But, as per the original goal of this analysis, we are really interested in the age of Drivers only.

Driver’s Age

We do see a similar distribution when only considering Drivers only. We now see that the left tail of the distribution seems flat as we don’t expect mqny children to be driving (at least in Canada!)

Gender of the Driver

Are more men or women behind the wheel before an accident?

Contrary to this reddit joke (in bad taste I may add) men are actually involved in more collisions than women overall.

Deadly Crashes

Is this difference between the sexes still present when we look at deadly crashes only?

Yes, and the contrast is much more pronounced. There are relatively a few deaths per year (thank god!), but in 3 out of every 4 collisions that do involve a death, a man was driving the vehicle.

Time

Let’s shift gears (no pun intended) and look at collisions in regards to time.

What day of the week has the most accidents?

Looks like Friday (5) has the most accidents (an increasing trend from Monday - perhaps people are in a rush to celebrate the weekend!) and Sunday has the least amount of accidents (this could be attributed to the fact that very few people work on Sundays so there might be less vehicles commuting to work overall).

Looks like deadly collisions happen the most on Saturdays (6) as opposed to Fridays (5) for non-deadly crashes.

What time of day do most collisions occur?

Not surprisingly, most collisions occur during 4:00-5:00 PM (not surprisingly many people are heading home after work during this hour).

For deadly collisions, a larger proportion occur during the sleeping hours (from midnight to before the start of the workday).

What month has the most accidents?

Looks like January (01) and the summer months (07 & 08) have the most number of reported collisions (could be that more people are driving in the summer for the holidays).

Does this change when we only look at drivers?

Not really, the absolute counts are less, but the distribution is quite similar.

Types of People Reporting Collisions

Speaking of drivers, we haven’t yet discussed how much of this dataset consists of collision reports from drivers as opposed to passengers, pedestrians, etc.

The vast majority of this dataset consists of reports from drivers (1) and passengers (2).

How’s the weather like when there are accidents?

Perhaps counterintuitively, most accidents occur on sunny days (1). At first, you might have thought that more accidents occur in bad weather.

Weather for motorcyclists vs drivers

A larger proportion of the accidents that motorcyclists are involved in occur on sunny days (1) than compared with motor vehicle drivers (this could simply be due to the fact that a lot of motorcyclists ride only in the spring and summer months with better weather conditions).

Number of vehicles involved in a collision

Most collisions involve, not surprisingly, 2 vehicles.

Vehicle Age

## [1] 2

Let’s cleanup the plot a bit and only look at vehicles manufactured after 1980.

## [1] 2

Doesn’t seem to be any clear trend other than most vehicles involved in accidents are from most recent models (it could simply be there are more newer vehicles on the road). Looks like a dead end here (pun intended!).

Vehicle Type

Okay, so now the type of vehicle you are driving has to be a factor in determining whether deaths are involved in collisions right? Let’s see.

Looks like the vast majority of vehicles involved in accidents are Light Duty Vehicles (1).

## [1] 2

Now remember we are now looking at drivers only; we see that a larger proportion of Trucks (07) & Road Tractors (08) are involved in deadly crashes. This may be one variable to help predict deadly crashes.

Univariate Analysis (continued)

What is/are the main feature(s) of interest in your dataset?

The main features of interest in the data set are C_SEV (the severity of collisions - injury, no injury, or fatality) and P_AGE (the age of person involved in collision). I suspected that driver’s age and some other combination of variables could help predict deadly collisions.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

In addition to the age of the driver, I originally thought that vehicle age (V_YEAR), vehicle type (V_TYPE), and gender (P_SEX) would be a larger factor in determining the fatality in collisions. The plots support this initial hunch except for vehicle age which doesn’t seem to vary between deadly and non-deadly collisions. I will look further at age in the bivariate analysis to see if their is a difference between deadly and non-deadly collisions.

Did you create any new variables from existing variables in the dataset?

Yes, I did create new variables just so that the data could be subset since the original variables were factors (categorical) and could not be subsetted otherwise (P_AGE_NUM & V_YEAR_NUM). I didn’t want to directly convert the original columns in case I needed the factor variables at a later time.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Yes, I did adjust the form of the data to show only the most recent year of data as there were millions of total records in the initial dataset. As mentioned earlier, I subset 300,000 records and then I removed data that had unknown or missing gender values as I suspected that the sex of the driver in fatal collisions could be skewed towards one sex (as confirmed by a histogram plot).

Bivariate Analysis

Further Data Wrangling

Before we get to bivariate plots, let’s do some additional data wrangling.

##                X C_SEV P_AGE_NUM V_YEAR_NUM
## X           1.00 -0.02      0.01         NA
## C_SEV      -0.02  1.00     -0.02         NA
## P_AGE_NUM   0.01 -0.02      1.00         NA
## V_YEAR_NUM    NA    NA        NA          1

This initial correlation table with the current numeric variables doesn’t show us much valuable info. Let’s see what other variables can be considered numeric so we can create a more useful correlation table.

df$C_MNTH_NUM <- as.numeric(df$C_MNTH); 
df$C_WDAY_NUM <- as.numeric(df$C_WDAY)
df$C_HOUR_NUM <- as.numeric(df$C_HOUR); 
df$C_VEHS_NUM <- as.numeric(df$C_VEHS) #number of vehicles
df$C_WTHR_NUM <- as.numeric(df$C_WTHR) #higher value = worse weather
df$C_RSUR_NUM <- as.numeric(df$C_RSUR) #higher value = worse road conditions
#Create a new variable, M (Mortality), 1 = Died, 0 = Survived 
df$M <- ifelse(df$C_SEV == 1, 1, 0) 
#However, some 0s are unknown values that need to be removed

We have created 6 new numeric variables from the original factor variable and new binary variable called M (Mortality). 1 means at least one person died immediately after the collision or went to the hospital and died as a result of his/her injuries. 0 is everyone involved survived the crash. However, this new M variable is based off of the C_SEV variable we looked at and there may be some unknown variables that still need to remove.

There happen to be no such unknown records for 2011. We will also further subset the data to only look at drivers since that’s what our original question asked: what features (related to drivers, their vehicles, and the collision itself) are common in accidents that result in fatalities?

Now we have reduced the records by just under 100,000 (175925 records) and we can look at the correlation matrix now.

Bivariate Plots

##                X C_YEAR C_SEV P_AGE_NUM V_YEAR_NUM C_MNTH_NUM C_WDAY_NUM
## X           1.00     NA -0.02      0.01         NA       1.00       0.08
## C_YEAR        NA      1    NA        NA         NA         NA         NA
## C_SEV      -0.02     NA  1.00     -0.02         NA      -0.02      -0.02
## P_AGE_NUM   0.01     NA -0.02      1.00         NA       0.02      -0.04
## V_YEAR_NUM    NA     NA    NA        NA          1         NA         NA
## C_MNTH_NUM  1.00     NA -0.02      0.02         NA       1.00       0.00
## C_WDAY_NUM  0.08     NA -0.02     -0.04         NA       0.00       1.00
## C_HOUR_NUM  0.03     NA  0.01     -0.01         NA       0.03       0.00
## C_VEHS_NUM -0.04     NA -0.02      0.02         NA      -0.04      -0.03
## C_WTHR_NUM -0.08     NA -0.01     -0.03         NA      -0.08       0.01
## C_RSUR_NUM -0.17     NA -0.01     -0.05         NA      -0.17       0.01
## M           0.02     NA -1.00      0.02         NA       0.02       0.02
##            C_HOUR_NUM C_VEHS_NUM C_WTHR_NUM C_RSUR_NUM     M
## X                0.03      -0.04      -0.08      -0.17  0.02
## C_YEAR             NA         NA         NA         NA    NA
## C_SEV            0.01      -0.02      -0.01      -0.01 -1.00
## P_AGE_NUM       -0.01       0.02      -0.03      -0.05  0.02
## V_YEAR_NUM         NA         NA         NA         NA    NA
## C_MNTH_NUM       0.03      -0.04      -0.08      -0.17  0.02
## C_WDAY_NUM       0.00      -0.03       0.01       0.01  0.02
## C_HOUR_NUM       1.00       0.01      -0.03      -0.04 -0.01
## C_VEHS_NUM       0.01       1.00       0.02      -0.02  0.02
## C_WTHR_NUM      -0.03       0.02       1.00       0.43  0.01
## C_RSUR_NUM      -0.04      -0.02       0.43       1.00  0.01
## M               -0.01       0.02       0.01       0.01  1.00

Definitely, there doesn’t seem to be any linear relationships here. Except for a decent correlation between Weather & Road Surface (C_WTHR_NUM & C_RSUR_NUM) of 0.43

No clear relationship. Looks like there are quite a few accidents when the weather is really bad and when the road surface is at its worst (top right hand corner).

Let’s see if we can get any inspiration to compare other bivariate relationships with a pairwise plot of some selected variables.

Let’s see what other relationships we can explore further.

Drivers in Newer Vehicles or Older Vehicles?

Looks like newer cars are involved in most accidents whether deadly (1) or not (0).

Vehicle Types

Looks like many deadly crashes had no safety device such as a seat belt used (light pink) (this is consistent across the major types of vehicles).

Let’s look at the age distribution of drivers of different types of vehicles.

Nothing out of the ordinary, we can see that school bus drivers (9) and RV drivers (18) skew towards older drivers.

Let’s now explore what traffic controls appear to be more effective than others.

From the top plot it appears as though no controls (18) being present would be the predominant traffic control when deadly accidents occur. But looking at the proportions, it looks as though the largest proportion of deaths actually occur where warning signs (5) are present.

Once again we have to remember that we are looking at the difference between deadly and non-deadly crashes.

Age of Deadly Drivers

This cumulative density plot shows teenage drivers and elderly drivers are overrepresented in collision deaths.

Maybe the configuration of the crash and the road has an effect on how deadly a crash is?

Head on collisions (31) make up a huge proportion of deadly crashes.

Railroad crossings (4) and passing lanes (7) are the type of road configurations with the largest proportion of deadly crashes.

Time Revisted - How about if we look at time of day across the different days?

Looks like distribution across the weekdays is bimodal throughout the “non-sleeping” hours (7 AM to Midnight), while the distribution on the weekend is almost identical in shape for Saturday & Sunday (being nearly unimodal for this same hour range).

Gender

Does age distribution vary across genders?

Not quite, the shape of both age distributions is very similar across sexes.

Bivariate Analysis (continued)

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Railroad crossings (4) and passing lanes (7) are the type of road configurations with the largest proportion of deadly crashes. Head on collisions (31) make up a huge proportion of deadly crashes. Interestingly, warning signs represent the traffic control category that has the largest proportion of deadly crashes.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Yes, there was an interesting relationship between hour and weekday. Looks like distribution across the weekdays is bimodal throughout the non-sleeping hours (7 AM-Midnight), while the distribution on the weekend is almost identical in shape for Saturday & Sunday (being nearly unimodal for this same hour range).

What was the strongest relationship you found?

The strongest relationship I found was between Weather (C_WTH_NUM) and Road Surface (CRSUR_NUM). There was a correlation of 0.43 between these two features. However, when plotting these two discrete values, no interesting trend was observed.

Multivariate Plots Section

No clear trend can be observed when plotting vehicle year, person’s age, and sex except what was seen before in an earlier plot that the vast majority of vehicles involved in collisions are newer vehicles.

It’s interesting to see that the age distribution of school bus drivers (09) and city bus drivers (11) is about the same (about 25 to 75). However, school bus drivers have about the same number of women and men involved in accidents (women slightly more), while city bus drivers have predominately more males in accidents.

Another dead end, no visible relatioship between month, vehicle type, and mortality (deadly or non-deadly crashes).

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

The only real observation that was observed was that the age distribution of school bus drivers (09) and city bus drivers (11) is about the same (about 25 to 75). However, school bus drivers have about the same number of women and men involved in accidents (women slightly more), while city bus drivers have predominately more males in accidents.

Were there any interesting or surprising interactions between features?

Nothing surprising, but I did expect to see more multivariate relationships than what I found.


Final Plots and Summary

Plot One

From the top plot we can see that the distribution of driver’s age is bimodal (1st peak at age 19 and 2nd peak at age 46). 19 year olds are indeed involved in the most accidents overall, but 18 year olds (*) are the ones involved in most deadly collisions (lower plot). We can also see that men and women are about equal in terms of being involved in accidents. However, when we look only at deadly collisions only (lower plot) we see that men are overrepresented.

Plot Two

We can see that light duty vehicles (1) represent the largest proportion of all collisions (including deadly collisions). A large proportion of deadly accidents also occured when no safety device (RED), such as a seatbelt, was used. This is consistent across vehicle types.

Plot Three

Head-on collisions (31) are the deadliest type of collisions for many vehicle types which is not a surprise, but its interesting to see that tractors in particular have a large proportion of deadly head-on collisions. You can also see that vast majority of accidents involve cars (or other light duty vehicles).


Reflection

I wasn’t able to satisfactorily answer the original question I posed as I need more statistical proof that the variables I suspected as being predictive of deadly crashes (due to the high proportions of these variables in deadly crashes as shown by the plots) are not due to just chance.

The features I would initially choose if I were building a machine learning model to predict deadly crashes are: P_SEX, P_SAFE, V_TYPE and C_RCFG because from the plots they appear to differ the most between deadly/non-deadly crashes.

Once I complete the next machine learning course, I will be able to see which classification model would be appropriate. What I did learn through my exploratory analysis is the linear model would not be appropriate in this case.

Here are some valuable things I learned throughout this process:

Reuse of Code

I learned that I would need to be very careful when subsetting data from a previous year in order to make it reusable. For example, I didn’t need to remove unknown variables for C_SEV because there were none for 2011, but that’s not necessarily true for other years. It is best to have the code in there assuming that the data has this missing data (even if it isn’t missing in the particular dataset you are currently working on).

Initial Bias

It’s also tempting to stop exploring the data when you arrive at the answer that you originally expected from intuition. Its important to keep going as things may not be as they seem from just the initial look at the data or intuition.

Summary vs Plots

I learned that a lot of the information about the data set could be gleaned just from the summary output (summmary(df)). However, many of the visualizations helped confirm these initial findings, but more importantly they helped in finding outliers and unusual relationships that were missed if you looked just at the summary. For example, it would not be able to see that older teenagers and elderly drivers are disproportionally responsible for deadly collisions.

Speed of Execution

I learned that it’s better to work with a smaller subset of the data in order to speed up execution of the code. For the pairwise plot, I used a sample of 10,000 records, but it still took several minutes to run this code even with less than 10% of the records.

Choosing the right plot

It’s very important to choose the right type of plot or else you can get misleading results. For example, comparing weather and road surface by a simple scatterplot suffers from overplotting and it shows that wet roads and windy weather is associated with deadly crashes and all other combinations are associated with non-deadly crashes. But when you use jitter, the pattern is much more clear and you can see that this is not necessarily the case..

Plotting in R Studio vs HTML Output

Finally, I want to discuss a frustration with plotting in R Studio vs the html output. If you use the zoom plot option in R Studio, you expect the final output to be the same. However, I found that although I could change the plot output width and height, I found the plots didn’t look as good as they did in R studio.